Skip to content

[Fix] Integer overflow in network frontend causes premature termination of simulation with empty end-to-end results#127

Open
WWeiOne wants to merge 2 commits into
aliyun:masterfrom
WWeiOne:master
Open

[Fix] Integer overflow in network frontend causes premature termination of simulation with empty end-to-end results#127
WWeiOne wants to merge 2 commits into
aliyun:masterfrom
WWeiOne:master

Conversation

@WWeiOne

@WWeiOne WWeiOne commented May 19, 2025

Copy link
Copy Markdown
Contributor

Problem Description

recvHash and all_sent_chunksize got integer overflow when the data size in collective communication is large, and there are implicit type conversions, leading to an early stop without an end-to-end result.

map<std::pair<int, std::pair<int, int>>, int> recvHash;
uint64_t count = recvHash[make_pair(tag, make_pair(t.src, t.dest))]

int all_sent_chunksize;
notify_sender_sending_finished(sid, did, all_sent_chunksize, flowTag);
void notify_sender_sending_finished(int sender_node, int receiver_node, **uint64_t** message_size, AstraSim::ncclFlowTag flowTag)

Observation

file: ncclFlowModel_EndToEnd.csv empty
file: SimAI.log, strange count, size and message size. and early stop.
Pasted image 20250519023943

Minimal Reproduction

Workload

HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 2 ep: 1 pp: 1 vpp: 36 ga: 1 all_gpus: 4 checkpoints: 0 checkpoint_initiates: 0 pp_comm: 0
2
grad_gather	-1	1	NONE	0	1	NONE	0	1	ALLGATHER	6467616768	100
grad_param_comm	-1	1	NONE	0	1	NONE	0	1	REDUCESCATTER	12935233536	100

Topology

7 2 2 1 8 H100
4 5 6 
0 4 2880Gbps 0.000025ms 0
0 6 100Gbps 0.0005ms 0
1 4 2880Gbps 0.000025ms 0
1 6 100Gbps 0.0005ms 0
2 5 2880Gbps 0.000025ms 0
2 6 100Gbps 0.0005ms 0
3 5 2880Gbps 0.000025ms 0
3 6 100Gbps 0.0005ms 0

Change Made:

  1. Update type of recvHash and all_sent_chunksize.
  2. Fix the relevant output log format specifiers and update to consistent language.

@CLAassistant

CLAassistant commented May 19, 2025

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@zyksir

zyksir commented May 30, 2025

Copy link
Copy Markdown
Collaborator

This commit looks good to me @Huoyuan100861

@gabrielecastellano

Copy link
Copy Markdown
Contributor

This indeed solves a problem that was hard to detect. Maybe also change total_bytes (in qp_finish) and all_sent_chunksize (in send_finish) to uint_64?

@WWeiOne

WWeiOne commented Jun 4, 2025

Copy link
Copy Markdown
Contributor Author

This indeed solves a problem that was hard to detect. Maybe also change total_bytes (in qp_finish) and all_sent_chunksize (in send_finish) to uint_64?

Good point! all_sent_chunksize has already been updated. I’ve now changed total_bytes to uint64, making it consistent with the type of q->m_size.

@WWeiOne

WWeiOne commented Jun 5, 2025

Copy link
Copy Markdown
Contributor Author

@zyksir @Huoyuan100861 All updates are in, ready for review

@gabrielecastellano

Copy link
Copy Markdown
Contributor

This very simple bugfix is from seven months ago, why was not merged yet?

@gabrielecastellano

Copy link
Copy Markdown
Contributor

Review required by @zyksir or @Huoyuan100861

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants